Game Analytics: From Bootstrapping to Predictive Modeling

Author

Hoang Son Lai

Published

November 17, 2025

1. Data Overview & Preprocessing

Code
# Load and clean the data
game_data <- read.csv("data/game_sessions.csv", stringsAsFactors = FALSE)

# Data cleaning and preprocessing
game_data_clean <- game_data %>%
  mutate(
    start_time = as.POSIXct(start_time, format = "%Y-%m-%dT%H:%M:%OSZ"),
    end_time = as.POSIXct(end_time, format = "%Y-%m-%dT%H:%M:%OSZ"),
    death_reason = as.factor(death_reason),
    # Handle missing end_time
    game_duration = ifelse(is.na(game_duration), 0, game_duration),
    # Create performance metrics
    score_per_second = ifelse(game_duration > 0, score / game_duration, 0),
    coins_per_pipe = ifelse(pipes_passed > 0, coins_collected / pipes_passed, 0),
    accuracy = ifelse(bullets_fired > 0, ufos_shot / bullets_fired, 0)
  ) %>%
  filter(!is.na(start_time))  # Remove incomplete records

# Display basic information
cat("Dataset Dimensions:", dim(game_data_clean), "\n")
Dataset Dimensions: 300 13 
Code
cat("Date Range:", as.character(min(game_data_clean$start_time)), "to", 
    as.character(max(game_data_clean$start_time)), "\n")
Date Range: 2025-11-16 12:33:32.897 to 2025-11-17 04:58:20.094 
Code
# Summary statistics
summary_stats <- game_data_clean %>%
  select(score, game_duration, coins_collected, ufos_shot, bullets_fired, pipes_passed) %>%
  summary()

# Display table
game_data_display <- game_data_clean %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

game_data_display %>%
  head(10) %>%
  gt() %>%
  tab_header(
    title = "Game Session Data — Preview (10 rows)"
  ) %>%
  opt_table_font(
    font = google_font("Roboto")
  ) %>%
  cols_align(
    align = "center",
    columns = everything()
  ) %>%
  tab_options(
    table.width = pct(100),
    column_labels.padding = px(6),
    data_row.padding = px(6),
    table.font.size = px(14)
  )
Game Session Data — Preview (10 rows)
id start_time end_time score coins_collected ufos_shot bullets_fired death_reason game_duration pipes_passed score_per_second coins_per_pipe accuracy
plane_1763296412897 2025-11-16 12:33:32.897 2025-11-16 12:33:40.65 8 2 2 33 pipe 7 3 1.14 0.67 0.06
plane_1763296421212 2025-11-16 12:33:41.212 2025-11-16 12:33:44.999 6 0 2 31 pipe 3 1 2.00 0.00 0.06
plane_1763296425226 2025-11-16 12:33:45.226 2025-11-16 12:33:45.949 0 0 0 0 ground 0 0 0.00 0.00 0.00
plane_1763296426741 2025-11-16 12:33:46.741 2025-11-16 12:33:47.465 0 0 0 0 ground 0 0 0.00 0.00 0.00
plane_1763296427702 2025-11-16 12:33:47.702 2025-11-16 12:33:48.415 0 0 0 0 ground 0 0 0.00 0.00 0.00
plane_1763296428948 2025-11-16 12:33:48.948 2025-11-16 12:33:49.665 0 0 0 0 ground 0 0 0.00 0.00 0.00
plane_1763296429950 2025-11-16 12:33:49.95 2025-11-16 12:33:54.182 6 0 2 25 pipe 4 2 1.50 0.00 0.08
plane_1763296435012 2025-11-16 12:33:55.012 2025-11-16 12:33:55.732 0 0 0 0 ground 0 0 0.00 0.00 0.00
plane_1763296435967 2025-11-16 12:33:55.967 2025-11-16 12:33:56.699 0 0 0 0 ground 0 0 0.00 0.00 0.00
plane_1763296437259 2025-11-16 12:33:57.259 2025-11-16 12:34:01.782 7 1 2 41 pipe 4 2 1.75 0.50 0.05

2. Exploratory Data Analysis

In this section, I analyze the distribution of key metrics and investigate relationships between variables to understand player behavior before modeling.

2.1 Distribution of Key Metrics

Code
library(ggplot2)
library(gridExtra)

# Histogram of Scores
p1 <- ggplot(game_data_clean, aes(x = score)) +
  geom_histogram(binwidth = 5, fill = "#4e79a7", color = "white", alpha = 0.8) +
  labs(title = "Distribution of Player Scores", x = "Score", y = "Count") +
  theme_minimal()

# Histogram of Game Duration
p2 <- ggplot(game_data_clean, aes(x = game_duration)) +
  geom_histogram(binwidth = 5, fill = "#f28e2b", color = "white", alpha = 0.8) +
  labs(title = "Distribution of Game Duration", x = "Duration (seconds)", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

2.2 Death Reason Analysis

Understanding why players fail is crucial for adjusting game difficulty.

Code
# Bar chart for death reasons
game_data_clean %>%
  filter(!is.na(death_reason)) %>%
  count(death_reason, sort = TRUE) %>%
  mutate(death_reason = reorder(death_reason, n)) %>%
  ggplot(aes(x = death_reason, y = n, fill = death_reason)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(title = "Common Causes of Death", x = "Death Reason", y = "Frequency") +
  theme_minimal()

2.3 Correlation Matrix

We check for correlations between numeric variables to identify potential predictors.

Code
library(corrplot)

# Select numeric columns for correlation
num_vars <- game_data_clean %>%
  select(score, game_duration, coins_collected, ufos_shot, bullets_fired, pipes_passed)

cor_matrix <- cor(num_vars, use = "complete.obs")

corrplot(cor_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black", diag = FALSE,
         title = "Feature Correlation Matrix", mar = c(0,0,1,0))

3. Bootstrapping Data for Machine Learning

Since the original dataset is small (~300 rows), we will bootstrap the training set to create a larger dataset (10,000+ samples) for robust model training. We reserve the last 50 records as a strict holdout test set.

Code
set.seed(123)

# 1. Split real data into Train (first 250) and Test (last 50)
# Sorting by start_time ensures we respect temporal order
game_sorted <- game_data_clean %>% arrange(start_time)
train_base <- head(game_sorted, 250)
test_holdout <- tail(game_sorted, 50)

# 2. Bootstrap the training data to 10,000 samples
# Sampling with replacement allows us to simulate a larger dataset based on observed patterns
bootstrap_size <- 10000
train_bootstrapped <- train_base %>%
  slice_sample(n = bootstrap_size, replace = TRUE) %>%
  mutate(is_synthetic = TRUE) # Flag for tracking

# Combine for verification (optional) but we will train on 'train_bootstrapped'
cat("Original Train Size:", nrow(train_base), "\n")
Original Train Size: 250 
Code
cat("Bootstrapped Train Size:", nrow(train_bootstrapped), "\n")
Bootstrapped Train Size: 10000 
Code
cat("Holdout Test Size:", nrow(test_holdout), "\n")
Holdout Test Size: 50 

4. Player Segmentation (Clustering)

We use K-Means clustering to identify distinct player personas based on their performance metrics.

Code
library(cluster)
library(factoextra)

# Select features for clustering
cluster_features <- train_bootstrapped %>%
  select(score, coins_collected, pipes_passed, ufos_shot, bullets_fired, game_duration)

# Scale the data
scaled_features <- scale(cluster_features)

# Determine optimal clusters (Elbow method - simplified for report)
# Using k=3 for broad segmentation: Beginners, Average, Pros
set.seed(123)
kmeans_model <- kmeans(scaled_features, centers = 3, nstart = 25)

# Add cluster labels back to the data
train_bootstrapped$cluster <- as.factor(kmeans_model$cluster)

# Visualize Clusters
fviz_cluster(kmeans_model, data = scaled_features,
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_minimal(),
             main = "Player Segmentation (K-Means)")

Code
# Summary of clusters
train_bootstrapped %>%
  group_by(cluster) %>%
  summarise(
    Avg_Score = mean(score),
    Avg_Duration = mean(game_duration),
    Avg_Coins = mean(coins_collected),
    Count = n()
  ) %>%
  gt() %>%
  tab_header(title = "Cluster Profiles")
Cluster Profiles
cluster Avg_Score Avg_Duration Avg_Coins Count
1 32.35764 23.952217 10.8684729 2030
2 4.72283 3.483075 0.9367524 7858
3 116.93750 78.857143 45.8482143 112

5. Predictive Modeling

5.1 Score Forecasting (Regression)

We use a Random Forest model to predict the final score based on gameplay metrics.

Code
library(randomForest)
library(caret)

# Define features
features <- c("coins_collected", "ufos_shot", "bullets_fired", "game_duration", "pipes_passed")

# Train Random Forest on Bootstrapped Data
rf_model_score <- randomForest(
  as.formula(paste("score ~", paste(features, collapse = "+"))),
  data = train_bootstrapped,
  ntree = 100,
  importance = TRUE
)

# Predict on Holdout Test Set
predictions_rf <- predict(rf_model_score, newdata = test_holdout)

# Evaluate Performance (RMSE & R-squared)
rmse_val <- RMSE(predictions_rf, test_holdout$score)
r2_val <- R2(predictions_rf, test_holdout$score)

cat("Random Forest Performance on Holdout Set:\n")
Random Forest Performance on Holdout Set:
Code
cat("RMSE:", round(rmse_val, 2), "\n")
RMSE: 1.8 
Code
cat("R-Squared:", round(r2_val, 4), "\n")
R-Squared: 0.9818 
Code
# Variable Importance Plot
varImpPlot(rf_model_score, main = "Feature Importance for Score Prediction")

5.2 Survival Analysis (Logistic Regression)

We predict whether a player will survive past a specific “expert” threshold (e.g., 30 seconds). This is a binary classification problem.

Code
# Define 'Survival' as lasting longer than 30 seconds
threshold <- 30

train_bootstrapped <- train_bootstrapped %>%
  mutate(survived_expert = as.factor(ifelse(game_duration > threshold, 1, 0)))

test_holdout <- test_holdout %>%
  mutate(survived_expert = as.factor(ifelse(game_duration > threshold, 1, 0)))

# Train Logistic Regression
# We remove variables that directly calculate duration/score to prevent data leakage, focusing on behavioral counts
log_model <- glm(survived_expert ~ bullets_fired + ufos_shot + coins_collected, 
                 data = train_bootstrapped, 
                 family = "binomial")

# Predict probabilities on Test Set
probs_survival <- predict(log_model, newdata = test_holdout, type = "response")
preds_survival <- ifelse(probs_survival > 0.5, 1, 0)

# Confusion Matrix
conf_matrix <- confusionMatrix(as.factor(preds_survival), test_holdout$survived_expert)

cat("Logistic Regression Accuracy:", round(conf_matrix$overall['Accuracy'], 4), "\n")
Logistic Regression Accuracy: 0.96 
Code
print(conf_matrix$table)
          Reference
Prediction  0  1
         0 47  1
         1  1  1

6. Business Insights & Recommendations

Based on the analysis above, we derive the following actionable insights:

6.1. Difficulty Balancing:

Observation: The death_reason analysis highlights the most common obstacles (e.g., pipes vs. enemies). If ‘pipe’ collisions are disproportionately high early in the game, the initial difficulty curve may be too steep.

Recommendation: Adjust the gap size or spawn rate of the leading cause of death in the first 10 seconds of gameplay to improve retention.

6.2. Player Segmentation Strategy:

Observation: K-Means clustering identified distinct groups. (Refer to cluster table: e.g., High-duration/low-coin collectors vs. Aggressive shooters).

Recommendation: Introduce targeted rewards.

  • For ‘Survivors’ (High duration, low action): Introduce time-based achievements.

  • For ‘Shooters’ (High bullets/UFOs): Offer weapon skins or visual upgrades for combat milestones.

6.3. Predictive Engagement:

Observation: The Random Forest model shows that specific actions (like coins_collected or ufos_shot) are strong predictors of high scores.

Recommendation: Create a tutorial or “Daily Mission” focusing on these high-value actions to teach new players how to achieve higher scores effectively.

6.4. Monetization Opportunities:

Observation: Players who survive past the 30-second threshold (analyzed in the Logistic Regression) show higher engagement.

Recommendation: Trigger “Continue?” ads or special offers only after a player has demonstrated this “expert” survival trait, as they are more invested in the session than a player who dies instantly.